The dataset is offered in two separated fields, one for the training and another one for the test set.
original_training_data = read.csv(file = file.path("Term 2 ie/Machine Learning/Assignment 1 /house-prices-advanced-regression-techniques/train.csv"))
original_test_data = read.csv(file = file.path("Term 2 ie/Machine Learning/Assignment 1 /house-prices-advanced-regression-techniques/test.csv"))
original_test_data$SalePrice <- 0
dataset <- rbind(original_training_data, original_test_data)
dataset <- as.data.frame(dataset)
summary(dataset)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.00 C (all): 25 Min. : 21.00
## 1st Qu.: 730.5 1st Qu.: 20.00 FV : 139 1st Qu.: 59.00
## Median :1460.0 Median : 50.00 RH : 26 Median : 68.00
## Mean :1460.0 Mean : 57.14 RL :2265 Mean : 69.31
## 3rd Qu.:2189.5 3rd Qu.: 70.00 RM : 460 3rd Qu.: 80.00
## Max. :2919.0 Max. :190.00 NA's : 4 Max. :313.00
## NA's :486
## LotArea Street Alley LotShape LandContour
## Min. : 1300 Grvl: 12 Grvl: 120 IR1: 968 Bnk: 117
## 1st Qu.: 7478 Pave:2907 Pave: 78 IR2: 76 HLS: 120
## Median : 9453 NA's:2721 IR3: 16 Low: 60
## Mean : 10168 Reg:1859 Lvl:2622
## 3rd Qu.: 11570
## Max. :215245
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:2916 Corner : 511 Gtl:2778 NAmes : 443 Norm :2511
## NoSeWa: 1 CulDSac: 176 Mod: 125 CollgCr: 267 Feedr : 164
## NA's : 2 FR2 : 85 Sev: 16 OldTown: 239 Artery : 92
## FR3 : 14 Edwards: 194 RRAn : 50
## Inside :2133 Somerst: 182 PosN : 39
## NridgHt: 166 RRAe : 28
## (Other):1428 (Other): 35
## Condition2 BldgType HouseStyle OverallQual
## Norm :2889 1Fam :2425 1Story :1471 Min. : 1.000
## Feedr : 13 2fmCon: 62 2Story : 872 1st Qu.: 5.000
## Artery : 5 Duplex: 109 1.5Fin : 314 Median : 6.000
## PosA : 4 Twnhs : 96 SLvl : 128 Mean : 6.089
## PosN : 4 TwnhsE: 227 SFoyer : 83 3rd Qu.: 7.000
## RRNn : 2 2.5Unf : 24 Max. :10.000
## (Other): 2 (Other): 27
## OverallCond YearBuilt YearRemodAdd RoofStyle
## Min. :1.000 Min. :1872 Min. :1950 Flat : 20
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1965 Gable :2310
## Median :5.000 Median :1973 Median :1993 Gambrel: 22
## Mean :5.565 Mean :1971 Mean :1984 Hip : 551
## 3rd Qu.:6.000 3rd Qu.:2001 3rd Qu.:2004 Mansard: 11
## Max. :9.000 Max. :2010 Max. :2010 Shed : 5
##
## RoofMatl Exterior1st Exterior2nd MasVnrType
## CompShg:2876 VinylSd:1025 VinylSd:1014 BrkCmn : 25
## Tar&Grv: 23 MetalSd: 450 MetalSd: 447 BrkFace: 879
## WdShake: 9 HdBoard: 442 HdBoard: 406 None :1742
## WdShngl: 7 Wd Sdng: 411 Wd Sdng: 391 Stone : 249
## ClyTile: 1 Plywood: 221 Plywood: 270 NA's : 24
## Membran: 1 (Other): 369 (Other): 390
## (Other): 2 NA's : 1 NA's : 1
## MasVnrArea ExterQual ExterCond Foundation BsmtQual
## Min. : 0.0 Ex: 107 Ex: 12 BrkTil: 311 Ex : 258
## 1st Qu.: 0.0 Fa: 35 Fa: 67 CBlock:1235 Fa : 88
## Median : 0.0 Gd: 979 Gd: 299 PConc :1308 Gd :1209
## Mean : 102.2 TA:1798 Po: 3 Slab : 49 TA :1283
## 3rd Qu.: 164.0 TA:2538 Stone : 11 NA's: 81
## Max. :1600.0 Wood : 5
## NA's :23
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Fa : 104 Av : 418 ALQ :429 Min. : 0.0 ALQ : 52
## Gd : 122 Gd : 276 BLQ :269 1st Qu.: 0.0 BLQ : 68
## Po : 5 Mn : 239 GLQ :849 Median : 368.5 GLQ : 34
## TA :2606 No :1904 LwQ :154 Mean : 441.4 LwQ : 87
## NA's: 82 NA's: 82 Rec :288 3rd Qu.: 733.0 Rec : 105
## Unf :851 Max. :5644.0 Unf :2493
## NA's: 79 NA's :1 NA's: 80
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Floor: 1
## 1st Qu.: 0.00 1st Qu.: 220.0 1st Qu.: 793.0 GasA :2874
## Median : 0.00 Median : 467.0 Median : 989.5 GasW : 27
## Mean : 49.58 Mean : 560.8 Mean :1051.8 Grav : 9
## 3rd Qu.: 0.00 3rd Qu.: 805.5 3rd Qu.:1302.0 OthW : 2
## Max. :1526.00 Max. :2336.0 Max. :6110.0 Wall : 6
## NA's :1 NA's :1 NA's :1
## HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## Ex:1493 N: 196 FuseA: 188 Min. : 334 Min. : 0.0
## Fa: 92 Y:2723 FuseF: 50 1st Qu.: 876 1st Qu.: 0.0
## Gd: 474 FuseP: 8 Median :1082 Median : 0.0
## Po: 3 Mix : 1 Mean :1160 Mean : 336.5
## TA: 857 SBrkr:2671 3rd Qu.:1388 3rd Qu.: 704.0
## NA's : 1 Max. :5095 Max. :2065.0
##
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## Min. : 0.000 Min. : 334 Min. :0.0000 Min. :0.00000
## 1st Qu.: 0.000 1st Qu.:1126 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 0.000 Median :1444 Median :0.0000 Median :0.00000
## Mean : 4.694 Mean :1501 Mean :0.4299 Mean :0.06136
## 3rd Qu.: 0.000 3rd Qu.:1744 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1064.000 Max. :5642 Max. :3.0000 Max. :2.00000
## NA's :2 NA's :2
## FullBath HalfBath BedroomAbvGr KitchenAbvGr
## Min. :0.000 Min. :0.0000 Min. :0.00 Min. :0.000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.00 1st Qu.:1.000
## Median :2.000 Median :0.0000 Median :3.00 Median :1.000
## Mean :1.568 Mean :0.3803 Mean :2.86 Mean :1.045
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.00 3rd Qu.:1.000
## Max. :4.000 Max. :2.0000 Max. :8.00 Max. :3.000
##
## KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## Ex : 205 Min. : 2.000 Typ :2717 Min. :0.0000 Ex : 43
## Fa : 70 1st Qu.: 5.000 Min2 : 70 1st Qu.:0.0000 Fa : 74
## Gd :1151 Median : 6.000 Min1 : 65 Median :1.0000 Gd : 744
## TA :1492 Mean : 6.452 Mod : 35 Mean :0.5971 Po : 46
## NA's: 1 3rd Qu.: 7.000 Maj1 : 19 3rd Qu.:1.0000 TA : 592
## Max. :15.000 (Other): 11 Max. :4.0000 NA's:1420
## NA's : 2
## GarageType GarageYrBlt GarageFinish GarageCars
## 2Types : 23 Min. :1895 Fin : 719 Min. :0.000
## Attchd :1723 1st Qu.:1960 RFn : 811 1st Qu.:1.000
## Basment: 36 Median :1979 Unf :1230 Median :2.000
## BuiltIn: 186 Mean :1978 NA's: 159 Mean :1.767
## CarPort: 15 3rd Qu.:2002 3rd Qu.:2.000
## Detchd : 779 Max. :2207 Max. :5.000
## NA's : 157 NA's :159 NA's :1
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
## Min. : 0.0 Ex : 3 Ex : 3 N: 216 Min. : 0.00
## 1st Qu.: 320.0 Fa : 124 Fa : 74 P: 62 1st Qu.: 0.00
## Median : 480.0 Gd : 24 Gd : 15 Y:2641 Median : 0.00
## Mean : 472.9 Po : 5 Po : 14 Mean : 93.71
## 3rd Qu.: 576.0 TA :2604 TA :2654 3rd Qu.: 168.00
## Max. :1488.0 NA's: 159 NA's: 159 Max. :1424.00
## NA's :1
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
## Min. : 0.00 Min. : 0.0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 26.00 Median : 0.0 Median : 0.000 Median : 0.00
## Mean : 47.49 Mean : 23.1 Mean : 2.602 Mean : 16.06
## 3rd Qu.: 70.00 3rd Qu.: 0.0 3rd Qu.: 0.000 3rd Qu.: 0.00
## Max. :742.00 Max. :1012.0 Max. :508.000 Max. :576.00
##
## PoolArea PoolQC Fence MiscFeature MiscVal
## Min. : 0.000 Ex : 4 GdPrv: 118 Gar2: 5 Min. : 0.00
## 1st Qu.: 0.000 Fa : 2 GdWo : 112 Othr: 4 1st Qu.: 0.00
## Median : 0.000 Gd : 4 MnPrv: 329 Shed: 95 Median : 0.00
## Mean : 2.252 NA's:2909 MnWw : 12 TenC: 1 Mean : 50.83
## 3rd Qu.: 0.000 NA's :2348 NA's:2814 3rd Qu.: 0.00
## Max. :800.000 Max. :17000.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 WD :2525 Abnorml: 190
## 1st Qu.: 4.000 1st Qu.:2007 New : 239 AdjLand: 12
## Median : 6.000 Median :2008 COD : 87 Alloca : 24
## Mean : 6.213 Mean :2008 ConLD : 26 Family : 46
## 3rd Qu.: 8.000 3rd Qu.:2009 CWD : 12 Normal :2402
## Max. :12.000 Max. :2010 (Other): 29 Partial: 245
## NA's : 1
## SalePrice
## Min. : 0
## 1st Qu.: 0
## Median : 34900
## Mean : 90492
## 3rd Qu.:163000
## Max. :755000
##
# Checking the percentage of missing values
sum(is.na(dataset)) / (nrow(dataset) *ncol(dataset))*100
## [1] 5.906386
# Checking duplicated rows
cat("The number of duplicated rows are", nrow(dataset) - nrow(unique(dataset)))
## The number of duplicated rows are 0
The definition of “meaningless” depends on your data and your intuition. A feature can lack any importance because you know for sure that it does not going to have any impact in the final prediction (e.g., the ID of the house). In addition, there are features that could be relevant but present wrong, empty or incomplete values (this is typical when there has been a problem in the data gathering process). For example, the feature Utilities present a unique value, consequently it is not going to offer any advantage for prediction.
We remove meaningless features and incomplete cases.
# Creating new features
# First and Second flooor summed
dataset<- as.data.table(dataset)
dataset$total_areaFlr <- dataset$X1stFlrSF + dataset$X2ndFlrSF +dataset$LowQualFinSF
# Porch areas summed
dataset<- as.data.table(dataset)
dataset$porch <- dataset$OpenPorchSF + dataset$EnclosedPorch + dataset$X3SsnPorch + dataset$ScreenPorch
# Areas of the basement summed
dataset$totalBsmtarea <- dataset$BsmtFinSF1 + dataset$BsmtFinSF2 + dataset$BsmtUnfSF
# Deleting variables we dont think are important (Justify)
dataset$Utilities <- NULL # All in one category
dataset$LotFrontage <- NULL # Too many business values and no business sense
dataset$Street <- NULL # Most of the variables were in one category
dataset$Alley <- NULL # Most of them were null values
dataset$Condition2 <- NULL # Most of them were in one category
dataset$RoofStyle <- NULL # No business relevant value
dataset$RoofMatl <- NULL # No business relevant value
dataset$MasVnrType <- NULL # A lot of missing values
dataset$MasVnrArea <- NULL # More than 50% of missing values
dataset$Exterior1st <- NULL # Replace for ExternQual
dataset$Exterior2nd <- NULL # Replace for ExternQual
dataset$BsmtFinSF1 <- NULL # In total area basement
dataset$BsmtFinSF2 <- NULL # In total area basement
dataset$BsmtUnfSF <- NULL # In total area basement
dataset$Heating <- NULL # Replace by heating quality and condition
dataset$Electrical <- NULL # No business value
dataset$X1stFlrSF <- NULL # Reflected in total area
dataset$X2ndFlrSF <- NULL # Reflected in total area
dataset$LowQualFinSF<- NULL# Reflected in total area
dataset$GarageYrBlt <- NULL # Reflected in the year of construction of the house and remodelation
### Delete all the porch classes
dataset$OpenPorchSF <- NULL
dataset$EnclosedPorch <- NULL
dataset$X3SsnPorch <- NULL
dataset$ScreenPorch <- NULL
# Casting variables
dataset$BsmtQual<- as.character(dataset$BsmtQual)
dataset$BsmtQual[is.na(dataset$BsmtQual)] <- "No Basement"
dataset$BsmtQual<- as.factor(dataset$BsmtQual)
dataset$BsmtCond<- as.character(dataset$BsmtCond)
dataset$BsmtCond[is.na(dataset$BsmtCond)] <- "No Basement"
dataset$BsmtCond<- as.factor(dataset$BsmtCond)
dataset$BsmtExposure<- as.character(dataset$BsmtExposure)
dataset$BsmtExposure[is.na(dataset$BsmtExposure)] <- "No Basement"
dataset$BsmtExposure<- as.factor(dataset$BsmtExposure)
dataset$BsmtFinType1<- as.character(dataset$BsmtFinType1)
dataset$BsmtFinType1[is.na(dataset$BsmtFinType1)] <- "No Basement"
dataset$BsmtFinType1<- as.factor(dataset$BsmtFinType1)
dataset$BsmtFinType2<- as.character(dataset$BsmtFinType2)
dataset$BsmtFinType2[is.na(dataset$BsmtFinType2)] <- "No Basement"
dataset$BsmtFinType2<- as.factor(dataset$BsmtFinType2)
dataset$KitchenQual<- as.factor(dataset$KitchenQual)
dataset$Functional <- as.factor(dataset$Functional)
dataset$FireplaceQu<- as.character(dataset$FireplaceQu)
dataset$FireplaceQu[is.na(dataset$FireplaceQu)] <- "No fireplace"
dataset$FireplaceQu<- as.factor(dataset$FireplaceQu)
dataset$GarageType<- as.character(dataset$GarageType)
dataset$GarageType[is.na(dataset$GarageType)] <- "No garage"
dataset$GarageType<- as.factor(dataset$GarageType)
dataset$GarageFinish<- as.character(dataset$GarageFinish)
dataset$GarageFinish[is.na(dataset$GarageFinish)] <- "No garage"
dataset$GarageFinish<- as.factor(dataset$GarageFinish)
dataset$GarageQual<- as.character(dataset$GarageQual)
dataset$GarageQual[is.na(dataset$GarageQual)] <- "No garage"
dataset$GarageQual<- as.factor(dataset$GarageQual)
dataset$GarageCond<- as.character(dataset$GarageCond)
dataset$GarageCond[is.na(dataset$GarageCond)] <- "No garage"
dataset$GarageCond<- as.factor(dataset$GarageCond)
dataset$Fence<- as.character(dataset$Fence)
dataset$Fence[is.na(dataset$Fence)] <- "No fence"
dataset$Fence<- as.factor(dataset$Fence)
dataset$MiscFeature<- as.character(dataset$MiscFeature)
dataset$MiscFeature[is.na(dataset$MiscFeature)] <- "No miscfeature"
dataset$MiscFeature<- as.factor(dataset$MiscFeature)
dataset$PoolQC<- as.character(dataset$PoolQC)
dataset$PoolQC[is.na(dataset$PoolQC)] <- "No pool"
dataset$PoolQC<- as.factor(dataset$PoolQC)
# Replacing missing values and wrong values
dataset$PoolQC<- as.factor(dataset$PoolQC)
levels(dataset$PoolQC)<- c('Ex', 'Fa', 'Gd', 'TA')
dataset$PoolQC[which(dataset$PoolArea=='368')]<-'TA'
dataset$PoolQC[which(dataset$PoolArea=='444')]<-'TA'
# Casting variables into categorical ones
dataset$MSSubClass <- as.factor(dataset$MSSubClass)
levels(dataset$MSSubClass)
## [1] "20" "30" "40" "45" "50" "60" "70" "75" "80" "85" "90"
## [12] "120" "150" "160" "180" "190"
dataset$MSZoning <- as.factor(dataset$MSZoning)
levels(dataset$MSZoning)
## [1] "C (all)" "FV" "RH" "RL" "RM"
dataset$LotConfig <- as.factor(dataset$LotConfig)
dataset$Neighborhood <- as.factor(dataset$Neighborhood)
dataset$Condition1 <- as.factor(dataset$Condition1)
dataset$BldgType <- as.factor(dataset$BldgType)
dataset$HouseStyle<- as.factor(dataset$HouseStyle)
dataset$HouseStyle<- as.factor(dataset$HouseStyle)
dataset$ExterQual<- as.factor(dataset$ExterQual)
dataset$ExterCond<- as.factor(dataset$ExterCond)
dataset$Foundation<- as.factor(dataset$Foundation)
dataset$BsmtQual<- as.factor(dataset$BsmtQual)
dataset$HeatingQC <- as.factor(dataset$HeatingQC)
dataset$CentralAir <- as.factor(dataset$CentralAir)
dataset$PavedDrive <- as.factor(dataset$PavedDrive)
dataset$MoSold <- as.factor(dataset$MoSold)
dataset$YearBuilt <- as.factor(dataset$YearBuilt)
dataset$SaleType <- as.factor(dataset$SaleType)
dataset$SaleCondition <- as.factor(dataset$SaleCondition)
# Advanced factorization
dataset$LotShape <-recode(dataset$LotShape, 'Reg'='Regular', 'IR1'= 'Regular', 'IR2'='Irregular', 'IR3'='Irregular')
dataset$LandContour <- recode(dataset$LandContour, 'Lvl'='Flat', 'Bnk'= 'Flat', 'HLS'='Notflat', 'Low'='Notflat')
dataset$LandSlope <- recode(dataset$LandSlope, 'Gtl'='NoSlope', 'Mod'= 'Slope', 'Sev'='Slope')
dataset$OverallCond<-recode(dataset$OverallCond,'1'='Low','2'='Low','3'='Low','4'='Intermediate', '5'='Intermediate','6'='Intermediate High','7'= 'Intermediate High', '8'= 'High', '9'='High','10'='High')
dataset$OverallCond <- as.factor(dataset$OverallCond)
# Transforming to class factors
dataset$YearRemodAdd<-cut(dataset$YearRemodAdd,c(0,1960,1980,2000,2005,2009,2010))
dataset$YearRemodAdd <- as.factor(dataset$YearRemodAdd)
# Columns with missing values
sapply(dataset, function(x) sum(is.na(x)))
## Id MSSubClass MSZoning LotArea LotShape
## 0 0 4 0 0
## LandContour LotConfig LandSlope Neighborhood Condition1
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd ExterQual ExterCond Foundation BsmtQual
## 0 0 0 0 0
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 TotalBsmtSF
## 0 0 0 0 1
## HeatingQC CentralAir GrLivArea BsmtFullBath BsmtHalfBath
## 0 0 0 2 2
## FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 0 0 0 0 1
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType
## 0 2 0 0 0
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 0 1 1 0 0
## PavedDrive WoodDeckSF PoolArea PoolQC Fence
## 0 0 0 0 0
## MiscFeature MiscVal MoSold YrSold SaleType
## 0 0 0 0 1
## SaleCondition SalePrice total_areaFlr porch totalBsmtarea
## 0 0 0 0 1
# Create the function to get the mode for MsZoning
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
dataset$MSZoning[is.na(dataset$MSZoning)] <- getmode(dataset$MSZoning)
# Total BsfmtSF
for (i in 1:length(dataset$TotalBsmtSF)){
if (dataset$BsmtCond[i] == "No Basement"){
dataset$TotalBsmtSF[i] = 0
}
if (is.na(dataset$TotalBsmtSF[i])) {
dataset$TotalBsmtSF[i] = as.numeric(mean(dataset$TotalBsmtSF, na.rm = T))
}
}
# BsmtFullBath
dataset$BsmtFullBath[is.na(dataset$BsmtFullBath)] <- 0
# BsmtHalfBath
dataset$BsmtHalfBath[is.na(dataset$BsmtHalfBath)] <- 0
# KitchenQual
dataset$KitchenQual[is.na(dataset$KitchenQual)] <- getmode(dataset$KitchenQual)
# Functional
dataset$Functional[is.na(dataset$Functional)]<- getmode(dataset$Functional)
# Garage cars
for (i in 1:length(dataset$GarageCars)){
if (dataset$GarageCond[i] == "No garage"){
dataset$GarageCars[i] = 0
}
if (is.na(dataset$GarageCars[i])) {
dataset$GarageCars[i] = as.numeric(mode(dataset$GarageCars, na.rm = T))
}
}
# Garage area
for (i in 1:length(dataset$GarageArea)){
if (dataset$GarageFinish[i] == "No garage"){
dataset$GarageArea[i] = 0
}
if (is.na(dataset$GarageArea[i])) {
dataset$GarageArea[i] = as.numeric(mean(dataset$GarageArea, na.rm = T))
}
}
# Sale type
dataset$SaleType[is.na(dataset$SaleType)]<- getmode(dataset$SaleType)
# BsmtFinSF1 & BsmtFinSF2 & BsmtUnfSF
dataset$totalBsmtarea[is.na(dataset$totalBsmtarea)] <- 0
sapply(dataset, function(x) sum(is.na(x)))
## Id MSSubClass MSZoning LotArea LotShape
## 0 0 0 0 0
## LandContour LotConfig LandSlope Neighborhood Condition1
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd ExterQual ExterCond Foundation BsmtQual
## 0 0 0 0 0
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 TotalBsmtSF
## 0 0 0 0 0
## HeatingQC CentralAir GrLivArea BsmtFullBath BsmtHalfBath
## 0 0 0 0 0
## FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 0 0 0 0 0
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType
## 0 0 0 0 0
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 0 0 0 0 0
## PavedDrive WoodDeckSF PoolArea PoolQC Fence
## 0 0 0 0 0
## MiscFeature MiscVal MoSold YrSold SaleType
## 0 0 0 0 0
## SaleCondition SalePrice total_areaFlr porch totalBsmtarea
## 0 0 0 0 0
We now need to detect skewness in the Target value. Let’s see what is the effect of skewness on a variable, and plot it using ggplot. The way of getting rid of the skewness is to use the log (or the log1p) of the values of that feature, to flatten it. To reduce right skewness, take roots or logarithms or reciprocals (x to 1/x). This is the commonest problem in practice. To reduce left skewness, take squares or cubes or higher powers.
df <- rbind(data.frame(version="price",x=original_training_data$SalePrice),
data.frame(version="log(price+1)",x=log(original_training_data$SalePrice + 1)))
ggplot(data=df) +
facet_wrap(~version,ncol=2,scales="free_x") +
geom_histogram(aes(x=x), bins = 50)
We therefore transform the target value applying log
# Log transform the target for official scoring
dataset$SalePrice <- log1p(dataset$SalePrice)
Creting a funtion to get the numerical columns
column_types <- sapply(names(dataset), function(x) {
class(dataset[[x]])
}
)
numeric_columns <- names(column_types[column_types != "factor"])
And now, with that information, we need to calculate the skewness of each column whose name is our list of factor (or categorical) features. We use the sapply method again, to compute the skewness of each column whose name is in the list of numeric_columns.
# skew of each variable
skew <- sapply(numeric_columns, function(x) {
e1071::skewness(dataset[[x]], na.rm = T)
}
)
hist(dataset$LotArea, breaks=40) # Right skewed, apply log (done before)
dataset$LotArea <- log1p(dataset$LotArea)
hist(dataset$OverallQual, breaks=40) # Not skewed
hist(dataset$TotalBsmtSF, breaks=40) # Right skewed
dataset$TotalBsmtSF <- log1p(dataset$TotalBsmtSF)
hist(dataset$GrLivArea, breaks=40) # Right skewed
dataset$GrLivArea <- log1p(dataset$GrLivArea)
hist(dataset$FullBath, breaks=40) # Not continuous variable
hist(dataset$HalfBath, breaks=40) # Not continuous variable
# Applies the same logic for BedroomAbvGr, KitchenAbvGr and TotRmsAbvGrd, Fireplaces , Garagecars
hist(dataset$GarageArea, breaks=40) # Not skewed
hist(dataset$WoodDeckSF, breaks=40) # Right skewed
dataset$WoodDeckSF <- log1p(dataset$WoodDeckSF)
hist(dataset$PoolArea, breaks=40) # Right skewed
dataset$PoolArea<- log1p(dataset$PoolArea)
hist(dataset$MiscVal, breaks=40) # Right skewed
dataset$MiscVal<- log1p(dataset$MiscVal)
hist(dataset$YrSold, breaks=40) # No skeweness
hist(dataset$total_area, breaks=40) # Right skewness
dataset$total_area<- log1p(dataset$total_area)
hist(dataset$porch, breaks=40) # Right skewed
dataset$porch<- log1p(dataset$porch)
This is the section to give free rein to your imagination and create all the features that might improve the final result. Do not worry if you add some “uninformative” feature because it will be removed by the later feature selection process. Do not hesitate to consult the competition kernels (please cite anything you fork).
#Basement Bathrooms added with weights
dataset$bathrooms <- 0.25*dataset$BsmtHalfBath + 0.75*dataset$BsmtFullBath
dataset$BsmtFullBath <- NULL
dataset$BsmtHalfBath <- NULL
dataset$GarageCars <- NULL
To facilitate the data cleaning and feature engineering we merged train and test datasets. We now split them again to create our final model.
training_data <- dataset[1:1460,]
test <- dataset[1461:2919,]
# Checking outliers for different numeric variables
df<- as.data.frame(training_data)
ggplot(df, aes(x="",y=LotArea))+ geom_boxplot(width=0.1) +
theme(axis.line.x=element_blank(),axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.text.x=element_blank(),legend.position="none") # for a cleaner visualization
We don’t want these extreme cases to affect or bias the training process, so the best is to remove them. We can apply some metric (i.e., the Z-score) to detect and remove these points. The boxplot.stats function itself provides a way to remove them.
# $out includes the outliers
to_remove <- boxplot.stats(training_data$LotArea)$out
cat("Number of outliers", length(to_remove))
## Number of outliers 131
Let’s do the same for the rest of the columns.
df <- as.data.frame(training_data)
outliercol <- NULL
for (col in names(df)) { # Go over all the features
if (is.numeric(df[[col]]) && col != "left"){ # Take only the numerical features
print(ggplot(df, aes_string(y=col))+ geom_boxplot(width=0.1) + theme(axis.line.x=element_blank(),axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.text.x=element_blank(),legend.position="none")) # Boxplot
outliercol<- c(outliercol,col)
}
}
outliercol
## [1] "Id" "LotArea" "OverallQual" "TotalBsmtSF"
## [5] "GrLivArea" "FullBath" "HalfBath" "BedroomAbvGr"
## [9] "KitchenAbvGr" "TotRmsAbvGrd" "Fireplaces" "GarageArea"
## [13] "WoodDeckSF" "PoolArea" "MiscVal" "YrSold"
## [17] "SalePrice" "total_areaFlr" "porch" "totalBsmtarea"
## [21] "total_area" "bathrooms"
#View(outliercol)
# Outliers removal
# LotArea
dataset <- dataset[!dataset$LotArea %in% to_remove, ]
# TotRmsAbvGrd
plot(training_data$SalePrice, training_data$TotRmsAbvGrd)
training_data <- training_data[training_data$TotRmsAbvGrd <12,]
# OverallQual (not deleting outliers)
plot(training_data$SalePrice, training_data$OverallQual)
# GrLiveArea (not deleting outliers since it is an area and it is possible to )
plot(training_data$SalePrice, training_data$GrLivArea)
# TotalBsmtSF (not deleting outliers since it is an area and it is possible to )
plot(training_data$SalePrice, training_data$TotalBsmtSF)
training_data <- training_data[training_data$TotalBsmtSF <8,]
# Fullbath (no significant outliers and extreme values can have a high importance)
plot(training_data$SalePrice, training_data$FullBath)
# Fullbath (no significant outliers and extreme values can have a high importance)
plot(training_data$SalePrice, training_data$HalfBath)
# BedroomAbvGr (outliers can be helpful in explaining the sale price)
plot(training_data$SalePrice, training_data$BedroomAbvGr)
# KitchenAbvGr (outliers can be helpful in explaining the sale price)
plot(training_data$SalePrice, training_data$KitchenAbvGr)
# TotRmsAbvGrd (outliers can be helpful in explaining the sale price)
plot(training_data$SalePrice, training_data$TotRmsAbvGrd)
# TotRmsAbvGrd (outliers can be helpful in explaining the sale price)
plot(training_data$SalePrice, training_data$Fireplaces)
# GarageArea (removing outliers that are too extreme)
plot(training_data$SalePrice, training_data$GarageArea)
training_data <- training_data[training_data$GarageArea <1200,]
# WoodDeckSF (no significant outliers)
plot(training_data$SalePrice, training_data$WoodDeckSF)
# PoolArea (no significant outliers, already removed skeweness)
plot(training_data$SalePrice, training_data$PoolArea)
# MiscVal (no significant outliers)
plot(training_data$SalePrice, training_data$MiscVal)
# Yr Sold (categorical variable- categorical variable)
plot(training_data$SalePrice, training_data$YrSold)
# total area (outliers can be helpful in explaining the model)
plot(training_data$SalePrice, training_data$total_area)
# porch(outliers can be helpful in explaining the model- already skewed)
plot(training_data$SalePrice, training_data$porch)
# Total Bsmtarea (removed extreme features)
plot(training_data$SalePrice, training_data$totalBsmtarea)
training_data <- training_data[training_data$totalBsmtarea <2500,]
splitdf <- function(dataframe, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
index <- 1:nrow(dataframe)
trainindex <- sample(index, trunc(length(index)/1.5))
trainset <- dataframe[trainindex, ]
testset <- dataframe[-trainindex, ]
list(trainset=trainset,testset=testset)
}
splits <- splitdf(training_data, seed=1)
training <- splits$trainset
validation <- splits$testset
lm.model <- function(training_dataset, validation_dataset, title) {
# Create a training control configuration that applies a 5-fold cross validation
train_control_config <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
returnResamp = "all")
# Fit a glm model to the input training data
this.model <- train(SalePrice ~ .,
data = training_dataset,
method = "glm",
metric = "RMSE",
preProc = c("center", "scale"),
trControl=train_control_config)
# Prediction
this.model.pred <- predict(this.model, validation_dataset)
this.model.pred[is.na(this.model.pred)] <- 0 # To avoid null predictions
# RMSE of the model
thismodel.rmse <- sqrt(mean((this.model.pred - validation_dataset$SalePrice)^2))
# Error in terms of the mean deviation between the predicted value and the price of the houses
thismodel.price_error <- mean(abs((exp(this.model.pred) -1) - (exp(validation_dataset$SalePrice) -1)))
# Plot the predicted values against the actual prices of the houses
my_data <- as.data.frame(cbind(predicted=(exp(this.model.pred) -1), observed=(exp(validation_dataset$SalePrice) -1)))
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "lm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste(title, 'RMSE: ', format(round(thismodel.rmse, 4), nsmall=4), ' --> Price ERROR:', format(round(thismodel.price_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
}
splitdf <- function(dataframe) {
set.seed(123)
index <- 1:nrow(dataframe)
trainindex <- sample(index, trunc(length(index)/1.5))
trainset <- dataframe[trainindex, ]
testset <- dataframe[-trainindex, ]
list(trainset=trainset,testset=testset)
}
We here start the Feature Selection.
Let’s try first a baseline including all the features to evaluate the impact of the feature engineering.
lm.model(training, validation, "Baseline")
Since we’ve problems with the FSelector package, let’s use the chisq.test included in the base package of R, to measure the relationship between the categorical features and the output. Only those.
# Compute the ChiSquared Statistic over the factor features ONLY
features <- names(training[, sapply(training, is.factor) & colnames(training) != 'SalePrice'])
chisquared <- data.frame(features, statistic = sapply(features, function(x) {
chisq.test(training$SalePrice, training[[x]])$statistic
}))
# Plot the result, and remove those below the 1st IQR (inter-quartile-range) --aggressive
par(mfrow=c(1,2))
boxplot(chisquared$statistic)
bp.stats <- as.integer(boxplot.stats(chisquared$statistic)$stats) # Get the statistics from the boxplot
chisquared.threshold = bp.stats[2] # This element represent the 1st quartile.
text(y = bp.stats, labels = bp.stats, x = 1.3, cex=0.7)
barplot(sort(chisquared$statistic), names.arg = chisquared$features, cex.names = 0.6, las=2, horiz = T)
abline(v=chisquared.threshold, col='red') # Draw a red line over the 1st IQR
Now, we can test if this a good move, by removing any feature with a Chi Squared test statistic against the output below the 1 IQR.
# Determine what features to remove from the training set.
training <- as.data.frame(training)
features_to_remove <- as.character(chisquared[chisquared$statistic < chisquared.threshold, "features"])
lm.model(training[!names(training) %in% features_to_remove], validation, "ChiSquared Model")
It is up to you to decide whether apply or not this selection based on the achieved results.
What to do with the numerical variables? We can always measure its relation with the outcome through the Spearman’s correlation coefficient, and remove those with a lower value. Let’s repeat the same process we did with the Chi Square but modifying our code to solely select numerical features and measuring Spearman’.
# Compute the ChiSquared Statistic over the factor features ONLY
features <- names(training[, sapply(training, is.numeric) & colnames(training) != 'SalePrice'])
spearman <- data.frame(features, statistic = sapply(features, function(x) {
cor(training$SalePrice, training[[x]], method='spearman')
}))
# Plot the result, and remove those below the 1st IQR (inter-quartile-range) --aggressive
par(mfrow=c(1,2))
boxplot(abs(spearman$statistic))
bp.stats <- boxplot.stats(abs(spearman$statistic))$stats # Get the statistics from the boxplot
text(y = bp.stats,
labels = sapply(bp.stats, function(x){format(round(x, 3), nsmall=3)}), # This is to reduce the nr of decimals
x = 1.3, cex=0.7)
spearman.threshold = bp.stats[2] # This element represent the 1st quartile.
barplot(sort(abs(spearman$statistic)), names.arg = spearman$features, cex.names = 0.6, las=2, horiz = T)
abline(v=spearman.threshold, col='red') # Draw a red line over the 1st IQR
Note: This might fail if you have null values in the numeric columns.
So, how good is our feature cleaning process? Let’s train the model with the new features, exactly as we did in the Chi Sq. section above.
# Determine what features to remove from the training set.
features_to_remove <- as.character(spearman[spearman$statistic < spearman.threshold, "features"])
lm.model(training[!names(training) %in% features_to_remove], validation, "ChiSquared Model")
Again, you have to decide if this selection is worthy, the final decision is yours.
This part is equivalent to the Chi Squared, but with another metric. So, the coding is very much equivalent, and I will not include it here.
Experiment now with Wrapper Methods and select what is the best possible compromise between the number of predictors and the results obtained.
Finally, we will experiment with embedded methods.
For this exercise, we are going to make use of the glmnet library. Take a look to the library to fit a glmnet model for Ridge Regression, using a grid of lambda values.
lambdas <- 10^seq(-3, 0, by = .05)
set.seed(121)
train_control_config <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
returnResamp = "all")
ridge.mod <- train(SalePrice ~ ., data = training,
method = "glmnet",
metric = "RMSE",
trControl=train_control_config,
tuneGrid = expand.grid(alpha = 0, lambda = lambdas))
Note: This will fail since there are null values in the dataset. You have to complete the Hunting NAs section before to exectue this step.
The parameter alpha = 0 means that we want to use the Ridge Regression way of expressing the penalty in regularization. If you replace that by alpha = 1 then you get Lasso.
Plotting the RMSE for the different lambda values, we can see the impact of this parameter in the model performance. Small values seem to work better for this dataset.
plot(ridge.mod)
Plotting the coefficients for different lambda values. As expected the larger the lambda (lower Norm) value the smaller the coefficients of the features. However, as we can see at the top of the features, there is no feature selection; i.e., the model always consider the 225 parameters.
plot(ridge.mod$finalModel)
ridge.mod.pred <- predict(ridge.mod, validation)
ridge.mod.pred[is.na(ridge.mod.pred)] <- 0
my_data <- as.data.frame(cbind(predicted=(exp(ridge.mod.pred) -1), observed=(exp(validation$SalePrice) -1)))
ridge.mod.rmse <- sqrt(mean((ridge.mod.pred - validation$SalePrice)^2))
ridge.mod.price_error <- mean(abs((exp(ridge.mod.pred) -1) - (exp(validation$SalePrice) -1)))
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("Ridge", 'RMSE: ', format(round(ridge.mod.rmse, 4), nsmall=4), ' --> Price ERROR:', format(round(ridge.mod.price_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
Rank the variables according to the importance attributed by the model.
# Print, plot variable importance
plot(varImp(ridge.mod), top = 20) # 20 most important features
# Determine what features to remove from the training set.
features_to_remove <- as.character(spearman[spearman$statistic < spearman.threshold, "features"])
lm.model(training[!names(training) %in% features_to_remove], validation, "ChiSquared Model")
The only think that changes between Lasso and Ridge is the alpha parameter. The remaining part of the exercise is equivalent.
lambdas <- 10^seq(-3, 0, by = .05)
set.seed(121)
train_control_config <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
returnResamp = "all")
Lasso.mod <- train(SalePrice ~ ., data = training,
method = "glmnet",
metric = "RMSE",
trControl=train_control_config,
tuneGrid = expand.grid(alpha = 1, lambda = lambdas))
Note: This will fail since there are null values in the dataset. You have to complete the Hunting NAs section before to exectue this step.
The parameter alpha = 0 means that we want to use the Ridge Regression way of expressing the penalty in regularization. If you replace that by alpha = 1 then you get Lasso.
Plotting the RMSE for the different lambda values, we can see the impact of this parameter in the model performance. Small values seem to work better for this dataset.
plot(Lasso.mod)
Plotting the coefficients for different lambda values. As expected the larger the lambda (lower Norm) value the smaller the coefficients of the features. However, as we can see at the top of the features, there is no feature selection; i.e., the model always consider the 225 parameters.
plot(Lasso.mod$finalModel)
Lasso.mod.pred <- predict(Lasso.mod, validation)
Lasso.mod.pred[is.na(Lasso.mod.pred)] <- 0
my_data <- as.data.frame(cbind(predicted=(exp(Lasso.mod.pred) -1), observed=(exp(validation$SalePrice) -1)))
Lasso.mod.rmse <- sqrt(mean((ridge.mod.pred - validation$SalePrice)^2))
Lasso.mod.price_error <- mean(abs((exp(Lasso.mod.pred) -1) - (exp(validation$SalePrice) -1)))
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("Ridge", 'RMSE: ', format(round(Lasso.mod.rmse, 4), nsmall=4), ' --> Price ERROR:', format(round(Lasso.mod.price_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
Rank the variables according to the importance attributed by the model.
# Print, plot variable importance
plot(varImp(Lasso.mod), top = 20) # 20 most important features
# Final Submission
Based on your analysis, you have to decide which cleaning and feature engineering procedures make sense in order to create your final model. We splitted the original training data into train and validation to evaluate the candidate models. In order to generate the final submission we have to take instead all the data at our disposal. In addition, remember that we also applied a log transformation to the target variable, to revert this transformation you have to use the exp function.
Let’s see this with the code. Imagine that your final model is the ridge.mod that we have just created. In order to generate the final submission:
# Train the model using all the data
final.model <- train(SalePrice ~ ., data = training,
method = "glmnet",
metric = "RMSE",
trControl=train_control_config,
tuneGrid = expand.grid(alpha = 0, lambda = lambdas))
# Predict the prices for the test data (i.e., we use the exp function to revert the log transformation that we applied to the target variable)
final.pred <- as.numeric(exp(predict(final.model, test))-1)
final.pred[is.na(final.pred)]
## numeric(0)
hist(final.pred, main="Histogram of Predictions", xlab = "Predictions")
lasso_submission <- data.frame(Id = original_test_data$Id, SalePrice= (final.pred))
colnames(lasso_submission) <-c("Id", "SalePrice")
write.csv(lasso_submission, file = "submission.csv", row.names = FALSE)
Note: This will fail since there are null values in the dataset. You have to complete the Hunting NAs section before to exectue this step.